Linguistically Informed and Corpus Informed Morphological Analysis of Arabic
نویسندگان
چکیده
Standard English PoS-taggers generally involve tag-assignment (via dictionary-lookup etc) followed by tag-disambiguation (via a context model, e.g. PoS-ngrams or Brill transformations). We want to PoS-tag our Arabic Corpus, but evaluation of existing PoStaggers has highlighted shortcomings; in particular, about a quarter of all word tokens are not assigned a fully correct morphological analysis. Tag-assignment is significantly more complex for Arabic. An Arabic lemmatiser program can extract the stem or root, but this is not enough for full PoS-tagging; words should be decomposed into five parts: proclitics, prefixes, stem or root, suffixes and postclitics. The morphological analyser should then add the appropriate linguistic information to each of these parts of the word; in effect, instead of a tag for a word, we need a subtag for each part (and possibly multiple subtags if there are multiple proclitics, prefixes, suffixes and postclitics). Many challenges face the implementation of Arabic morphology, the rich “root-andpattern” nonconcatenative (or nonlinear) morphology and the highly complex word formation process of root and patterns, especially if one or two long vowels are part of the root letters. Moreover, the orthographic issues of Arabic such as short vowels ( َ ُ ِ ), Hamzah (ء إ أ ؤ ئ), Taa’ Marboutah ( ة ) and Ha’ ( ), Ya’ ( ي ) and Alif Maksorah( ى ) , Shaddah ( ّ ) or gemination, and Maddah ( ) or extension which is a compound letter of Hamzah and Alif ( اأ ). Our morphological analyzer uses linguistic knowledge of the language as well as corpora to verify the linguistic information. To understand the problem, we started by analyzing fifteen established Arabic language dictionaries, to build a broad-coverage lexicon which contains not only roots and single words but also multi-word expressions, idioms, collocations requiring special part-of-speech assignment, and words with special part-of-speech tags. The next stage of research was a detailed analysis and classification of Arabic language roots to address the “tail” of hard cases for existing morphological analyzers, and analysis of the roots, word-root combinations and the coverage of each root category of the Qur’an and the word-root information stored in our lexicon. From authoritative Arabic grammar books, we extracted and generated comprehensive lists of affixes, clitics and patterns. These lists were then crosschecked by analyzing words of three corpora: the Qur’an, the Corpus of Contemporary Arabic and Penn Arabic Treebank (as well as our Lexicon, considered as a fourth cross-check corpus). We also developed a novel algorithm that generates the correct pattern of the words, which deals with the orthographic issues of the Arabic language and other word derivation issues, such as the elimination or substitution of root letters.
منابع مشابه
Engineering Terminology -a Case for a Linguistically- Informed Terminology Database
Terminology databases of specialist domains contain a wealth of lexical, semantic and pragmatic data associated with each of the terms stored. However, until recently, the amount of syntactic and morphological data associated with each term is either nonexistent or entered as ad-hoc grammatical data, like compound nouns, adjectival nouns, etc. The recently completed EC-sponsored TRANSTERM proje...
متن کاملSeat Usage Data Analysis and Its Application for Library Marketing
Seat Usage Data Analysis and Its Application for Library Marketing MDL: Metrics Definition Language p. 248 Natural Language Processing and Computational Linguistics A Statistical Global Feature Extraction Method for Optical Font Recognition p. 257 Domain N-Gram Construction and Its Application to Text Editor p. 268 Grounding Two Notions of Uncertainty in Modal Conditional Statements p. 278 Deve...
متن کاملA Universal Feature Schema for Rich Morphological Annotation and Fine-Grained Cross-Lingual Part-of-Speech Tagging
Semantically detailed and typologically-informed morphological analysis that is broadly applicable cross-linguistically has the potential to improve many NLP applications, including machine translation, n-gram language models, information extraction, and co-reference resolution. In this paper, we present a universal morphological feature schema, which is a set of features that represent the fin...
متن کاملDeveloping a Deep Linguistic Databank Supporting a Collection of Treebanks: the CINTIL DeepGramBank
Corpora of sentences annotated with grammatical information have been deployed by extending the basic lexical and morphological data with increasingly complex information, such as phrase constituency, syntactic functions, semantic roles, etc. As these corpora grow in size and the linguistic information to be encoded reaches higher levels of sophistication, the utilization of annotation tools an...
متن کاملCombining Morphological and Ngram Evidence for Monolingual Document Retrieval
We report on experiments in which we merged the results of linguistically informed and linguistically ignorant approaches to retrieval for European languages. We found that even high-quality base runs can be improved by means of fairly simple techniques for merging them with other runs, although the improvements no longer seem to be as dramatic as those reported on previous experiments on small...
متن کامل